The Importance of Window Length in Splice Site Prediction
نویسندگان
چکیده
The performance of gene prediction programs strongly depends on the methods that they use to locate splice sites. Different pattern recognition techniques are available to assess the quality of candidate splice sites, see [1] for an overview and further references. All of these techniques proceed by computing a score derived from the distribution of the nucleotides in the neighbourhood of a splice site consensus sequence. These scores are normally obtained with splice sites models that have been estimated from large training sets of exemplary neighbourhoods. The training sets may also include negative examples, i.e. sequences that contain the consensus sequence, but that are actually no splice sites. Unfortunately, the concept of ‘neighbourhood’ is rather ambiguous, and there is no general recommendation about the positions of the nucleotides that should be included in the calculation, i.e. the analysis window that should be employed. In principle, the window length is an important parameter, because it determines the amount of information that has to be evaluated. On the one hand, the window should be long enough to provide as many details as possible about the patterns. On the other hand, the window should be short enough to take only the relevant information, in order to improve generalization. In the present study, we investigate how splice-site prediction accuracy depends on the window size and shape, using support vector machines (SVM) [2]. Our results show that the choice of the window is crucial for splice site prediction, and therefore we suggest that the window length should be considered as an essential parameter of the model.
منابع مشابه
Pre-mRNA Secondary Structure Prediction Aids Splice Site Prediction
Accurate splice site prediction is a critical component of any computational approach to gene prediction in higher organisms. Existing approaches generally use sequence-based models that capture local dependencies among nucleotides in a small window around the splice site. We present evidence that computationally predicted secondary structure of moderate length pre-mRNA subsequencies contains i...
متن کاملPrediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA.
Prediction of splice site selection and efficiency from sequence inspection is of fundamental interest (testing the current knowledge of requisite sequence features) and practical importance (genome annotation, design of mutant or transgenic organisms). In plants, the dominant variables affecting splice site selection and efficiency include the degree of matching to the extended splice site con...
متن کاملIdentification of a Novel Splice Site Mutation in RUNX2 Gene in a Family with Rare Autosomal Dominant Cleidocranial Dysplasia
Introduction: Pathogenic variants of RUNX2, a gene that encodes an osteoblast-specific transcription factor, have been shown as the cause of CCD, which is a rare hereditary skeletal and dental disorder with dominant mode of inheritance and a broad range of clinical variability. Due to the relative lack of clinical complications resulting in CCD, the medical diagnosis of this disorder is challen...
متن کاملDataset Construction for Gene Structure Prediction and Alternative Splicing Analysis
The performance of gene finding from genome sequences strongly depends on the accuracy of splice site prediction. Recent gene finding programs, however, still do not reach enough levels. To improve the accuracy of splice site prediction, it is required to understand the splicing mechanism and to make a model from clear experimental evidences. For this purpose, genomic full-length precursor mRNA...
متن کامل